Natural language generation

Natural Language Generation (NLG) is the natural language processing task of generating natural language from a machine representation system such as a knowledge base or a logical form. Psycholinguists prefer the term language production when such formal representations are interpreted as models for mental representations.

In a sense, one can say that an NLG system is like a translator that converts a computer based representation into a natural language representation. However, the methods to produce the final language are very different from those of a compiler due to the inherent expressivity of natural languages.

NLG may be viewed as the opposite of natural language understanding. The difference can be put this way: whereas in natural language understanding the system needs to disambiguate the input sentence to produce the machine representation language, in NLG the system needs to make decisions about how to put a concept into words.

The simplest (and perhaps trivial) examples are systems that generate form letters. Such systems do not typically involve grammar rules, but may generate a letter to a consumer, e.g. stating that a credit card spending limit is about to be reached. More complex NLG systems dynamically create texts to meet a communicative goal. As in other areas of natural language processing, this can be done using either explicit models of language (e.g., grammars) and the domain, or using statistical models derived by analysing human-written texts.

NLG is a fast-evolving field. The best single source for up-to-date research in the area is the SIGGEN portion of the ACL Anthology. Perhaps the closest the field comes to a specialist textbook is Reiter and Dale (2000),[1] but this book does not describe developments in the field since 2000.

Contents

Example

The Pollen Forecast for Scotland demo [2] shows a simple NLG system in action. This system takes as input six numbers, which give predicted pollen levels in different parts of Scotland. From these numbers, the system generates a short textual summary of pollen levels as its output.

For example, using the historical data for 1-July-2005, the software produces

Grass pollen levels for Friday have increased from the moderate to high levels of yesterday with values of around 6 to 7 across most parts of the country. However, in Northern areas, pollen levels will be moderate with values of 4.

In contrast, the actual forecast (written by a human meteorologist) from this data was

Pollen counts are expected to remain high at level 6 over most of Scotland, and even level 7 in the south east. The only relief is in the Northern Isles and far northeast of mainland Scotland with medium levels of pollen count.

Comparing these two illustrates some of the choices that NLG systems must make; these are further discussed below.

Stages

The process to generate text can be as simple as keeping a list of canned text that is copied and pasted, possibly linked with some glue text. The results may be satisfactory in simple domains such as horoscope machines or generators of personalised business letters. However, a sophisticated NLG system needs to include stages of planning and merging of information to enable the generation of text that looks natural and does not become repetitive. Typical stages are:

Content determination: Deciding what information to mention in the text. For instance, in the pollen example above, deciding whether to explicitly mention that pollen level is 7 in the south east.

Document structuring: Overall organisation of the information to convey. For example, deciding to describe the areas with high pollen levels first, instead of the areas with low pollen levels.

Aggregation: Merging of similar sentences to improve readability and naturalness. For instance, merging the two sentences Grass pollen levels for Friday have increased from the moderate to high levels of yesterday and Grass pollen levels will be around 6 to 7 across most parts of the country into the single sentence Grass pollen levels for Friday have increased from the moderate to high levels of yesterday with values of around 6 to 7 across most parts of the country.

Lexical choice: Putting words to the concepts. For example, deciding whether medium or moderate should be used when describing a pollen level of 4.

Referring expression generation: Creating referring expressions that identify objects and regions. For example, deciding to use in the Northern Isles and far northeast of mainland Scotland to refer to a certain region in Scotland. This task also includes making decisions about pronouns and other types of anaphora.

Realisation: Creating the actual text, which should be correct according to the rules of syntax, morphology, and orthography. For example, using will be for the future tense of to be.

Applications

The popular media has been especially interested in NLG systems which generate jokes (see computational humor). But from a commercial perspective, the most successful NLG applications have been data-to-text systems which generate textual summaries of databases and data sets; these systems usually perform data analysis as well as text generation. In particular, several systems have been built that produce textual weather forecasts from weather data. The earliest such system to be deployed was FoG,[3] which was used by Environment Canada to generate weather forecasts in French and English in the early 1990s. The success of FoG triggered other work, both research and commercial. Recent research in this area include an experiment which showed that users sometimes preferred computer-generated weather forecasts to human-written ones, in part because the computer forecasts used more consistent terminology ,[4] and a demonstration that statistical techniques could be used to generate high-quality weather forecasts.[5] Recent applications include the ARNS system used to summarise conditions in US ports.

In the 1990s there was considerable interest in using NLG to summarise financial and business data. For example the SPOTLIGHT system developed at A.C. Nielsen automatically generated readable English text based on the analysis of large amounts of retail sales data.[6] More recently there is growing interest in using NLG to summarise electronic medical records. Commercial applications in this area are starting to appear ,[7] and researchers have shown that NLG summaries of medical data can be effective decision-support aids for medical professionals.[8] There is also growing interest is using NLG to enhance accessibility, for example by describing graphs and data sets to blind people.

An example for a highly interactive use of NLG is the WYSIWYM framework. It stands for What you see is what you meant and allows users to see and manipulate the continuously rendered view (NLG output) of an underlying formal language document (NLG input), thereby editing the formal language without having to learn it.

Evaluation

As in other scientific fields, NLG researchers need to be able to test how well their systems, modules, and algorithms work. This is called evaluation. There are three basic techniques for evaluating NLG systems:

Generally speaking, what we ultimately want to know is how useful NLG systems are at helping people, which is the first of the above techniques. However, task-based evaluations are time-consuming and expensive, and can be difficult to carry out (especially if they require subjects with specialised expertise, such as doctors). Hence (as in other areas of NLP) task-based evaluations are the exception, not the norm.

In recent years researchers have started trying to assess how well human-ratings and metrics correlate with (predict) task-based evaluations. Much of this work is being conducted in the context of Generation Challenges shared-task events. Initial results suggest that human ratings are much better than metrics in this regard. In other words, human ratings usually do predict task-effectiveness at least to some degree (although there are exceptions [9]), while ratings produced by metrics often do not predict task-effectiveness well. These results are very preliminary, hopefully better data will be available soon. In any case, human ratings are currently the most popular evaluation technique in NLG; this is contrast to machine translation, where metrics are very widely used.

References

  1. ^ Dale, Robert; Reiter, Ehud (2000). Building natural language generation systems. Cambridge, UK: Cambridge University Press. ISBN 0-521-02451-X. 
  2. ^ R Turner, S Sripada, E Reiter, I Davy (2006). Generating Spatio-Temporal Descriptions in Pollen Forecasts. Proceedings of EACL06
  3. ^ Goldberg E, Driedger N, Kittredge R (1994). "Using Natural-Language Processing to Produce Weather Forecasts". IEEE Expert 9 (2): 45–53. doi:10.1109/64.294135. 
  4. ^ Reiter E, Sripada S, Hunter J, Yu J, Davy I (2005). "Choosing Words in Computer-Generated Weather Forecasts". Artificial Intelligence 167: 137–69. doi:10.1016/j.artint.2005.06.006. 
  5. ^ Belz A (2008). "Automatic Generation of Weather Forecast Texts Using Comprehensive Probabilistic Generation-Space Models". Natural Language Engineering 14: 431–55. 
  6. ^ Anand, Tej; Kahn, Gary (1992). "Making Sense of Gigabytes: A System for Knowledge-Based Market Analysis". In Klahr, Philip; Scott, A. F.. Innovative applications of artificial intelligence 4: proceedings of the IAAI-92 Conference. Menlo Park, Calif: AAAI Press. pp. 57–70. ISBN 0-262-69155-8. http://www.aaai.org/Papers/IAAI/1992/IAAI92-006.pdf. 
  7. ^ Harris MD (2008). "Building a Large-Scale Commercial NLG System for an EMR". Proceedings of the Fifth International Natural Language Generation Conference. pp. 157–60. http://www.aclweb.org/anthology/W08-1120.pdf. 
  8. ^ a b Portet F, Reiter E, Gatt A, Hunter J, Sripada S, Freer Y, Sykes C (2009). "Automatic Generation of Textual Summaries from Neonatal Intensive Care Data". Artificial Intelligence 173 (7–8): 789–816. doi:10.1016/j.artint.2008.12.002. 
  9. ^ Law A, Freer Y, Hunter J, Logie R, McIntosh N, Quinn J (2005). "A Comparison of Graphical and Textual Presentations of Time Series Data to Support Medical Decision Making in the Neonatal Intensive Care Unit". Journal of Clinical Monitoring and Computing 19 (3): 183–94. doi:10.1007/s10877-005-0879-3. PMID 16244840. 

External links